The objective of this analytical report is to help companies identify good employees who are at risk of leaving the company. With this information, companies can allocate their finances and resources on in areas that can help in retaining good employees.
First, we will analyze and visualize the data to get a basic understanding of the data inhand (Human Resources Analytics by Ludovic Benistant from kaggle.com). After basic understanding of the data, we will check the correlation of the factors to identify and interpret the key factors that drive employees to leave.
Second, we will segment the employees across two dimensions + add more
Finally, we will bucket the employees who have stayed across two dimensions, performance and risk of leaving, in order to predict and identify the characteristics of employees companies generally wish to retain even at a higher cost - high performing employees with high risk of leaving (and maybe even identify the low performing employees with low possiblity of leaving). This will help the company to target and invest in their human resources and reduce the risk and negative impact of losing high performing employees.
First we load the data to use.
ProjectData <- read.csv("./data/HR_data.csv")
ProjectData = data.matrix(ProjectData)
Description of data - Employee satisfaction level - Last evaluation - Number of projects - Average monthly hours - Time spent at the company - Whether they have had a work accident - Whether they have had a promotion in the last 5 years - Department - Salary (1=low, 2=medium, 3=high) - Whether the employee has left
This is how the first 10 data look.
| Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| satisfaction_level | 0.38 | 0.80 | 0.11 | 0.72 | 0.37 | 0.41 | 0.10 | 0.92 | 0.89 | 0.42 |
| last_evaluation | 0.53 | 0.86 | 0.88 | 0.87 | 0.52 | 0.50 | 0.77 | 0.85 | 1.00 | 0.53 |
| number_project | 2.00 | 5.00 | 7.00 | 5.00 | 2.00 | 2.00 | 6.00 | 5.00 | 5.00 | 2.00 |
| average_montly_hours | 157.00 | 262.00 | 272.00 | 223.00 | 159.00 | 153.00 | 247.00 | 259.00 | 224.00 | 142.00 |
| time_spend_company | 3.00 | 6.00 | 4.00 | 5.00 | 3.00 | 3.00 | 4.00 | 5.00 | 5.00 | 3.00 |
| Work_accident | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| left | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| promotion_last_5years | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| salary_level | 1.00 | 2.00 | 2.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| sales | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| accounting | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| hr | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| technical | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| support | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| management | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| IT | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| product_mng | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| marketing | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| RandD | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
The data we use here have the following descriptive statistics.
| min | 25 percent | median | mean | 75 percent | max | std | |
|---|---|---|---|---|---|---|---|
| satisfaction_level | 0.09 | 0.44 | 0.64 | 0.61 | 0.82 | 1 | 0.25 |
| last_evaluation | 0.36 | 0.56 | 0.72 | 0.72 | 0.87 | 1 | 0.17 |
| number_project | 2.00 | 3.00 | 4.00 | 3.80 | 5.00 | 7 | 1.23 |
| average_montly_hours | 96.00 | 156.00 | 200.00 | 201.05 | 245.00 | 310 | 49.94 |
| time_spend_company | 2.00 | 3.00 | 3.00 | 3.50 | 4.00 | 10 | 1.46 |
| Work_accident | 0.00 | 0.00 | 0.00 | 0.14 | 0.00 | 1 | 0.35 |
| left | 0.00 | 0.00 | 0.00 | 0.24 | 0.00 | 1 | 0.43 |
| promotion_last_5years | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 1 | 0.14 |
| salary_level | 1.00 | 1.00 | 2.00 | 1.59 | 2.00 | 3 | 0.64 |
| sales | 0.00 | 0.00 | 0.00 | 0.28 | 1.00 | 1 | 0.45 |
| accounting | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
| hr | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
| technical | 0.00 | 0.00 | 0.00 | 0.18 | 0.00 | 1 | 0.39 |
| support | 0.00 | 0.00 | 0.00 | 0.15 | 0.00 | 1 | 0.36 |
| management | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 1 | 0.20 |
| IT | 0.00 | 0.00 | 0.00 | 0.08 | 0.00 | 1 | 0.27 |
| product_mng | 0.00 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.24 |
| marketing | 0.00 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.23 |
| RandD | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
ProjectDataFactor_scaled = apply(ProjectDataFactor, 2, function(r) {
res = (r - min(r))/(max(r) - min(r))
res
})
Notice now the summary statistics of the scaled dataset:
| min | 25 percent | median | mean | 75 percent | max | std | |
|---|---|---|---|---|---|---|---|
| satisfaction_level | 0 | 0.38 | 0.60 | 0.57 | 0.80 | 1 | 0.27 |
| last_evaluation | 0 | 0.31 | 0.56 | 0.56 | 0.80 | 1 | 0.27 |
| number_project | 0 | 0.20 | 0.40 | 0.36 | 0.60 | 1 | 0.25 |
| average_montly_hours | 0 | 0.28 | 0.49 | 0.49 | 0.70 | 1 | 0.23 |
| time_spend_company | 0 | 0.12 | 0.12 | 0.19 | 0.25 | 1 | 0.18 |
| Work_accident | 0 | 0.00 | 0.00 | 0.14 | 0.00 | 1 | 0.35 |
| left | 0 | 0.00 | 0.00 | 0.24 | 0.00 | 1 | 0.43 |
| promotion_last_5years | 0 | 0.00 | 0.00 | 0.02 | 0.00 | 1 | 0.14 |
| salary_level | 0 | 0.00 | 0.50 | 0.30 | 0.50 | 1 | 0.32 |
| sales | 0 | 0.00 | 0.00 | 0.28 | 1.00 | 1 | 0.45 |
| accounting | 0 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
| hr | 0 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
| technical | 0 | 0.00 | 0.00 | 0.18 | 0.00 | 1 | 0.39 |
| support | 0 | 0.00 | 0.00 | 0.15 | 0.00 | 1 | 0.36 |
| management | 0 | 0.00 | 0.00 | 0.04 | 0.00 | 1 | 0.20 |
| IT | 0 | 0.00 | 0.00 | 0.08 | 0.00 | 1 | 0.27 |
| product_mng | 0 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.24 |
| marketing | 0 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.23 |
| RandD | 0 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
This is the correlation matrix.
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | salary_level | sales | accounting | hr | technical | support | management | IT | product_mng | marketing | RandD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| satisfaction_level | 1.00 | 0.11 | -0.14 | -0.02 | -0.10 | 0.06 | -0.39 | 0.03 | 0.05 | 0.00 | -0.03 | -0.01 | -0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
| last_evaluation | 0.11 | 1.00 | 0.35 | 0.34 | 0.13 | -0.01 | 0.01 | -0.01 | -0.01 | -0.02 | 0.00 | -0.01 | 0.01 | 0.02 | 0.01 | 0.00 | 0.00 | 0.00 | -0.01 |
| number_project | -0.14 | 0.35 | 1.00 | 0.42 | 0.20 | 0.00 | 0.02 | -0.01 | 0.00 | -0.01 | 0.00 | -0.03 | 0.03 | 0.00 | 0.01 | 0.00 | 0.00 | -0.02 | 0.01 |
| average_montly_hours | -0.02 | 0.34 | 0.42 | 1.00 | 0.13 | -0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -0.01 | 0.01 | 0.00 | 0.00 | 0.01 | -0.01 | -0.01 | 0.00 |
| time_spend_company | -0.10 | 0.13 | 0.20 | 0.13 | 1.00 | 0.00 | 0.14 | 0.07 | 0.05 | 0.02 | 0.00 | -0.02 | -0.03 | -0.03 | 0.12 | -0.01 | 0.00 | 0.01 | -0.02 |
| Work_accident | 0.06 | -0.01 | 0.00 | -0.01 | 0.00 | 1.00 | -0.15 | 0.04 | 0.01 | 0.00 | -0.01 | -0.02 | -0.01 | 0.01 | 0.01 | -0.01 | 0.00 | 0.01 | 0.02 |
| left | -0.39 | 0.01 | 0.02 | 0.07 | 0.14 | -0.15 | 1.00 | -0.06 | -0.16 | 0.01 | 0.02 | 0.03 | 0.02 | 0.01 | -0.05 | -0.01 | -0.01 | 0.00 | -0.05 |
| promotion_last_5years | 0.03 | -0.01 | -0.01 | 0.00 | 0.07 | 0.04 | -0.06 | 1.00 | 0.10 | 0.01 | 0.00 | 0.00 | -0.04 | -0.04 | 0.13 | -0.04 | -0.04 | 0.05 | 0.02 |
| salary_level | 0.05 | -0.01 | 0.00 | 0.00 | 0.05 | 0.01 | -0.16 | 0.10 | 1.00 | -0.04 | 0.01 | 0.00 | -0.02 | -0.03 | 0.16 | -0.01 | -0.01 | 0.01 | 0.00 |
| sales | 0.00 | -0.02 | -0.01 | 0.00 | 0.02 | 0.00 | 0.01 | 0.01 | -0.04 | 1.00 | -0.14 | -0.14 | -0.29 | -0.26 | -0.13 | -0.18 | -0.16 | -0.15 | -0.15 |
| accounting | -0.03 | 0.00 | 0.00 | 0.00 | 0.00 | -0.01 | 0.02 | 0.00 | 0.01 | -0.14 | 1.00 | -0.05 | -0.11 | -0.10 | -0.05 | -0.07 | -0.06 | -0.06 | -0.05 |
| hr | -0.01 | -0.01 | -0.03 | -0.01 | -0.02 | -0.02 | 0.03 | 0.00 | 0.00 | -0.14 | -0.05 | 1.00 | -0.11 | -0.10 | -0.05 | -0.07 | -0.06 | -0.06 | -0.05 |
| technical | -0.01 | 0.01 | 0.03 | 0.01 | -0.03 | -0.01 | 0.02 | -0.04 | -0.02 | -0.29 | -0.11 | -0.11 | 1.00 | -0.20 | -0.10 | -0.14 | -0.12 | -0.12 | -0.11 |
| support | 0.01 | 0.02 | 0.00 | 0.00 | -0.03 | 0.01 | 0.01 | -0.04 | -0.03 | -0.26 | -0.10 | -0.10 | -0.20 | 1.00 | -0.09 | -0.12 | -0.11 | -0.10 | -0.10 |
| management | 0.01 | 0.01 | 0.01 | 0.00 | 0.12 | 0.01 | -0.05 | 0.13 | 0.16 | -0.13 | -0.05 | -0.05 | -0.10 | -0.09 | 1.00 | -0.06 | -0.05 | -0.05 | -0.05 |
| IT | 0.01 | 0.00 | 0.00 | 0.01 | -0.01 | -0.01 | -0.01 | -0.04 | -0.01 | -0.18 | -0.07 | -0.07 | -0.14 | -0.12 | -0.06 | 1.00 | -0.08 | -0.07 | -0.07 |
| product_mng | 0.01 | 0.00 | 0.00 | -0.01 | 0.00 | 0.00 | -0.01 | -0.04 | -0.01 | -0.16 | -0.06 | -0.06 | -0.12 | -0.11 | -0.05 | -0.08 | 1.00 | -0.06 | -0.06 |
| marketing | 0.01 | 0.00 | -0.02 | -0.01 | 0.01 | 0.01 | 0.00 | 0.05 | 0.01 | -0.15 | -0.06 | -0.06 | -0.12 | -0.10 | -0.05 | -0.07 | -0.06 | 1.00 | -0.06 |
| RandD | 0.01 | -0.01 | 0.01 | 0.00 | -0.02 | 0.02 | -0.05 | 0.02 | 0.00 | -0.15 | -0.05 | -0.05 | -0.11 | -0.10 | -0.05 | -0.07 | -0.06 | -0.06 | 1.00 |
We use all the variables except “Whether the employee has left.” We use Euclidean distance.
segmentation_attributes_used = c(1:6, 8:19)
profile_attributes_used = c(1:19)
numb_clusters_used = 5
profile_with = "hclust"
distance_used = "euclidean"
hclust_method = "ward.D"
Here are the differences between the observations using the distance metric we selected:
| Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Obs.01 | 0.00 | |||||||||
| Obs.02 | 1.21 | 0.00 | ||||||||
| Obs.03 | 1.39 | 0.89 | 0.00 | |||||||
| Obs.04 | 0.97 | 0.55 | 0.96 | 0.00 | ||||||
| Obs.05 | 0.02 | 1.22 | 1.39 | 0.98 | 0.00 | |||||
| Obs.06 | 0.06 | 1.23 | 1.43 | 0.99 | 0.06 | 0.00 | ||||
| Obs.07 | 1.03 | 0.98 | 0.58 | 0.75 | 1.03 | 1.07 | 0.00 | |||
| Obs.08 | 1.12 | 0.53 | 1.11 | 0.28 | 1.13 | 1.13 | 0.94 | 0.00 | ||
| Obs.09 | 1.17 | 0.60 | 1.12 | 0.28 | 1.18 | 1.19 | 0.97 | 0.29 | 0.00 | |
| Obs.10 | 0.08 | 1.23 | 1.43 | 0.98 | 0.10 | 0.07 | 1.08 | 1.13 | 1.17 | 0 |
We can see the histogram of, say, the first 2 variables.
or the histogram of all pairwise distances for the euclidean distance:
Let’s use Hierarchical Clustering methods. It may be useful to see the dendrogram from , to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:
We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.
For now let’s consider the 5-segments solution. We can also see the segment each observation (respondent in this case) belongs to for the first 20 people:
| Observation Number | Cluster_Membership |
|---|---|
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 1 |
| 6 | 1 |
| 7 | 1 |
| 8 | 1 |
| 9 | 1 |
| 10 | 1 |
| 11 | 1 |
| 12 | 1 |
| 13 | 1 |
| 14 | 1 |
| 15 | 1 |
| 16 | 1 |
| 17 | 1 |
| 18 | 1 |
| 19 | 2 |
| 20 | 1 |
Having decided how many clusters to use, we would like to get a better understanding of who the customers in those clusters are and interpret the segments.
The average values of our data for the total population as well as within each customer segment are:
| Population | Segment 1 | Segment 2 | Segment 3 | Segment 4 | Segment 5 | |
|---|---|---|---|---|---|---|
| satisfaction_level | 0.57 | 0.57 | 0.57 | 0.57 | 0.58 | 0.58 |
| last_evaluation | 0.56 | 0.55 | 0.56 | 0.56 | 0.57 | 0.56 |
| number_project | 0.36 | 0.36 | 0.36 | 0.38 | 0.36 | 0.36 |
| average_montly_hours | 0.49 | 0.49 | 0.49 | 0.50 | 0.49 | 0.50 |
| time_spend_company | 0.19 | 0.19 | 0.20 | 0.18 | 0.17 | 0.18 |
| Work_accident | 0.14 | 0.14 | 0.15 | 0.14 | 0.15 | 0.13 |
| left | 0.24 | 0.25 | 0.22 | 0.26 | 0.25 | 0.22 |
| promotion_last_5years | 0.02 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 |
| salary_level | 0.30 | 0.27 | 0.34 | 0.28 | 0.27 | 0.29 |
| sales | 0.28 | 1.00 | 0.02 | 0.00 | 0.00 | 0.00 |
| accounting | 0.05 | 0.00 | 0.16 | 0.00 | 0.00 | 0.00 |
| hr | 0.05 | 0.00 | 0.15 | 0.00 | 0.00 | 0.00 |
| technical | 0.18 | 0.00 | 0.01 | 1.00 | 0.00 | 0.00 |
| support | 0.15 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 |
| management | 0.04 | 0.00 | 0.13 | 0.00 | 0.00 | 0.00 |
| IT | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| product_mng | 0.06 | 0.00 | 0.19 | 0.00 | 0.00 | 0.00 |
| marketing | 0.06 | 0.00 | 0.18 | 0.00 | 0.00 | 0.00 |
| RandD | 0.05 | 0.00 | 0.16 | 0.00 | 0.00 | 0.00 |
we can measure the ratios of the average for each cluster to the average of the population and subtract 1 (e.g. avg(cluster) / avg(population) - 1) and explore a matrix as the following one:
| Segment 1 | Segment 2 | Segment 3 | Segment 4 | Segment 5 | |
|---|---|---|---|---|---|
| satisfaction_level | 0.00 | 0.00 | -0.01 | 0.01 | 0.01 |
| last_evaluation | -0.02 | 0.00 | 0.01 | 0.02 | 0.00 |
| number_project | -0.02 | -0.01 | 0.04 | 0.00 | 0.01 |
| average_montly_hours | 0.00 | -0.01 | 0.01 | 0.00 | 0.01 |
| time_spend_company | 0.02 | 0.05 | -0.06 | -0.07 | -0.02 |
| Work_accident | -0.04 | 0.04 | -0.03 | 0.07 | -0.08 |
| left | 0.05 | -0.09 | 0.08 | 0.05 | -0.07 |
| promotion_last_5years | -1.00 | 2.08 | -1.00 | -1.00 | -0.89 |
| salary_level | -0.08 | 0.13 | -0.04 | -0.08 | -0.04 |
| sales | 2.62 | -0.93 | -1.00 | -1.00 | -1.00 |
| accounting | -1.00 | 2.10 | -1.00 | -1.00 | -1.00 |
| hr | -1.00 | 2.10 | -1.00 | -1.00 | -1.00 |
| technical | -1.00 | -0.97 | 4.51 | -1.00 | -1.00 |
| support | -1.00 | -0.97 | -1.00 | 5.73 | -1.00 |
| management | -1.00 | 2.10 | -1.00 | -1.00 | -1.00 |
| IT | -1.00 | -1.00 | -1.00 | -1.00 | 11.22 |
| product_mng | -1.00 | 2.10 | -1.00 | -1.00 | -1.00 |
| marketing | -1.00 | 2.10 | -1.00 | -1.00 | -1.00 |
| RandD | -1.00 | 2.10 | -1.00 | -1.00 | -1.00 |
The segment profile looks to depend too much on department.
Let’s try the analysis again excluding department information.
We use all the variables except “Whether the employee has left.” and department. We use Euclidean distance.
segmentation_attributes_used = c(1:6, 9)
profile_attributes_used = c(1:9)
numb_clusters_used = 5
profile_with = "hclust"
distance_used = "euclidean"
hclust_method = "ward.D"
Here are the differences between the observations using the distance metric we selected:
| Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Obs.01 | 0.00 | |||||||||
| Obs.02 | 1.21 | 0.00 | ||||||||
| Obs.03 | 1.39 | 0.89 | 0.00 | |||||||
| Obs.04 | 0.97 | 0.55 | 0.96 | 0.00 | ||||||
| Obs.05 | 0.02 | 1.22 | 1.39 | 0.98 | 0.00 | |||||
| Obs.06 | 0.06 | 1.23 | 1.43 | 0.99 | 0.06 | 0.00 | ||||
| Obs.07 | 1.03 | 0.98 | 0.58 | 0.75 | 1.03 | 1.07 | 0.00 | |||
| Obs.08 | 1.12 | 0.53 | 1.11 | 0.28 | 1.13 | 1.13 | 0.94 | 0.00 | ||
| Obs.09 | 1.17 | 0.60 | 1.12 | 0.28 | 1.18 | 1.19 | 0.97 | 0.29 | 0.00 | |
| Obs.10 | 0.08 | 1.23 | 1.43 | 0.98 | 0.10 | 0.07 | 1.08 | 1.13 | 1.17 | 0 |
We can see the histogram of, say, the first 2 variables.
or the histogram of all pairwise distances for the euclidean distance:
Let’s use Hierarchical Clustering methods. It may be useful to see the dendrogram from , to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:
We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.
For now let’s consider the 5-segments solution. We can also see the segment each observation (respondent in this case) belongs to for the first 20 people:
| Observation Number | Cluster_Membership |
|---|---|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 1 |
| 6 | 1 |
| 7 | 3 |
| 8 | 4 |
| 9 | 4 |
| 10 | 1 |
| 11 | 1 |
| 12 | 3 |
| 13 | 4 |
| 14 | 1 |
| 15 | 1 |
| 16 | 1 |
| 17 | 1 |
| 18 | 4 |
| 19 | 5 |
| 20 | 4 |
Having decided how many clusters to use, we would like to get a better understanding of who the customers in those clusters are and interpret the segments.
The average values of our data for the total population as well as within each customer segment are:
| Population | Segment 1 | Segment 2 | Segment 3 | Segment 4 | Segment 5 | |
|---|---|---|---|---|---|---|
| satisfaction_level | 0.57 | 0.36 | 0.70 | 0.08 | 0.69 | 0.61 |
| last_evaluation | 0.56 | 0.25 | 0.59 | 0.70 | 0.60 | 0.55 |
| number_project | 0.36 | 0.04 | 0.36 | 0.70 | 0.36 | 0.36 |
| average_montly_hours | 0.49 | 0.24 | 0.50 | 0.69 | 0.51 | 0.49 |
| time_spend_company | 0.19 | 0.13 | 0.20 | 0.28 | 0.16 | 0.19 |
| Work_accident | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| left | 0.24 | 0.78 | 0.09 | 0.55 | 0.15 | 0.08 |
| promotion_last_5years | 0.02 | 0.01 | 0.04 | 0.01 | 0.01 | 0.04 |
| salary_level | 0.30 | 0.25 | 0.58 | 0.27 | 0.00 | 0.30 |
we can measure the ratios of the average for each cluster to the average of the population and subtract 1 (e.g. avg(cluster) / avg(population) - 1) and explore a matrix as the following one:
| Segment 1 | Segment 2 | Segment 3 | Segment 4 | Segment 5 | |
|---|---|---|---|---|---|
| satisfaction_level | -0.38 | 0.22 | -0.86 | 0.20 | 0.07 |
| last_evaluation | -0.55 | 0.05 | 0.25 | 0.07 | -0.01 |
| number_project | -0.90 | 0.00 | 0.94 | 0.01 | -0.01 |
| average_montly_hours | -0.51 | 0.03 | 0.40 | 0.03 | -0.01 |
| time_spend_company | -0.31 | 0.09 | 0.48 | -0.16 | 0.01 |
| Work_accident | -1.00 | -1.00 | -1.00 | -1.00 | 5.92 |
| left | 2.29 | -0.64 | 1.33 | -0.38 | -0.67 |
| promotion_last_5years | -0.37 | 0.69 | -0.72 | -0.69 | 0.65 |
| salary_level | -0.16 | 0.94 | -0.08 | -1.00 | 0.02 |